Final Project: Volcanic Eruptions Analysis¶

Course: Programming for Data Science (MESIIN471625) Team Members:

  • Isabela MORA (Leader)
  • Mathieu MAURY
  • Paul MILLIEN
  • Yannix MICHOUX

Dataset: Volcanic Eruptions (NOAA)

Introduction & Goal¶

The goal of this project is to apply the Knowledge Discovery in Databases (KDD) process to extract meaningful insights from historical volcanic eruption data (from 4360 BC to the present). We will implement a visualization dashboard displaying four key indicators, using techniques such as data discretization, temporal trend analysis (regression), and spatial clustering (K-Means).

All code is encapsulated in functions for modularity and reusability.

Environment Setup¶

Objective: Import all necessary libraries to support the KDD process (Data Cleaning, Mining, and Visualization).

Technical Decisions (Library Selection):

  • pandas & numpy: Selected for their efficiency in handling tabular data and vector operations. They are the standard for the "Preprocessing" phase.
  • plotly & dash: Chosen over matplotlib because the project requires an interactive dashboard (zooming, filtering). Static images would not meet the "Dashboard" requirement effectively.
  • sklearn (Scikit-Learn): Used for the "Data Mining" phase. We specifically need:
    • KMeans: For unsupervised clustering (spatial analysis).
    • LinearRegression: For trend analysis (temporal bias).
  • warnings: Configured to ignore. Reason: During the final presentation, we want a clean output without clutter from deprecation warnings (common with Plotly/Pandas versions).
In [1]:
# Import necessary libraries

# 1. Data Manipulation & Linear Algebra
import pandas as pd # Core library for dataframes (tables)
import numpy as np # Optimized numerical operations

# 2. System & Files
import os            # To manage file paths (OS independent)
import kagglehub     # (Optional) To fetch external datasets automatically

# 3. Interactive Visualization (The Dashboard Engine)
import plotly.express as px       # High-level interface for quick charts
import plotly.graph_objects as go # Low-level interface for custom charts (needed for dual-axis)
from dash import Dash, dcc, html, Input, Output # Components for the web application

# 4. Machine Learning (Data Mining)
from sklearn.cluster import KMeans             # Unsupervised Learning (for Tectonic Zones)
from sklearn.linear_model import LinearRegression # Supervised Learning (for Temporal Trend)
from sklearn.preprocessing import MinMaxScaler # For normalizing data (if needed)

# 5. Presentation Settings
import warnings
# Suppress warnings to ensure a professional, clean look during the demo.
# (ex: avoids "SettingWithCopyWarning" cluttering the dashboard output)

# Suppress warnings for a cleaner notebook
warnings.filterwarnings('ignore')

# Force Pandas to display all columns. 
# This is crucial during the 'Exploration' phase to see all attributes of the dataset.
pd.set_option('display.max_columns', None)

print("Libraries succesfully imported")
Libraries succesfully imported

1. Data Cleaning & Preprocessing (Step 1 of KDD)¶

Objective: Transform raw data into a reliable format for analysis.

Decisions & Rationale:

  • Handling Missing Years: We drop rows where the 'Year' is missing. Time is a critical dimension for our analysis; an event without a date is statistically useless for trend analysis.
  • Handling Deaths (NaN): Missing values in the 'Deaths' column are replaced by 0.
    • Assumption: In historical records, if no death count is mentioned, it generally implies no casualties occurred or the event was minor. Leaving them as NaN would break the aggregation functions.
  • Type Conversion: Columns like Year, Latitude, and Longitude are forced to numeric types to prevent string-based errors during plotting.

Data Loading¶

Objective: Import the raw dataset into the Python environment to begin the analysis.

Key Decisions & Rationale:

  • File Format Handling: The dataset is a .tsv (Tab-Separated Values) file, not a standard CSV. We explicitly define the separator as \t to parse columns correctly.
  • Encoding Strategy: We use latin1 encoding instead of the standard utf-8.
    • Reason: Historical geological datasets often contain special characters (e.g., accents in names like Eyjafjallajökull or Popocatépetl). Standard UTF-8 reading often fails or corrupts these names. latin1 is more robust for Western European languages found in this dataset.
  • Robustness: We utilize on_bad_lines='skip'. In older datasets, human formatting errors (like an extra tab) can crash the loading process. We decided it is better to skip a single corrupted row than to halt the entire analysis.
In [2]:
def load_data(file_path):
    """
    Function : Loads data from a TSV file into a Pandas DataFrame
    
    Input :
        file_path (str): Path to the dataset file.
        
    Output :
        pd.DataFrame: The loaded dataframe.
    """
    try:
        # The dataset is Tab-Separated Values (TSV), we use 'read_csv' but with separator '\t'
        # We use latin1 encoding because historical datasets often contain special characters
        # 'on_bad_lines=skip' prevents the script from crashing if one row is malformed.
        df = pd.read_csv(file_path, sep='\t', header=0, encoding='latin1', on_bad_lines = 'skip') 
        print(f"Data loaded successfully: {df.shape}")
        
        return df
    except Exception as e:
        # General error catching
        print(f"Error loading data: {e}")
        
        return None

Exploratory Data Analysis (EDA)¶

Objective: Gain a global overview of the dataset's "health" before applying any transformations. This step helps identify data quality issues (sparsity, outliers, duplicates).

Decisions & Smart Assumptions:

  • Missing Values Analysis:
    • Anticipation: We expect a high percentage of missing values in the Deaths and VEI columns.
    • Reasoning: Historical records from 2000 BC naturally lack precise casualty counts or scientific measurements (VEI). This justifies our later decision to impute deaths with 0 (assuming no record = minor event) rather than dropping them.
  • Statistical Scope:
    • Decision: We limit the statistical summary (describe) to specific columns: Year, VEI, Deaths, Latitude, Longitude.
    • Reasoning: The raw dataset contains many text-based columns (Descriptions, Agents) which are irrelevant for numerical statistics. We focus on the dimensions of Time, Power, Impact, and Space.
  • Integrity Check: Checking for duplicates prevents "double-counting" eruptions, which would bias our frequency analysis.
In [3]:
def summarize_data(df):
    """
    Function : Explores and summarizes the dataset (Preview, Types, Missing Values, Stats).
    Adapted from standard Exploratory Data Analysis practices in previous labs.
    
    Input:
        df (pd.DataFrame): The dataframe to analyze.
        
    Output:
        None (Prints summary to console). Returns df for chaining if needed..
    """
    print("\nData Summarization")
    
    # Visual Preview
    # Check to see if data loaded correctly (headers aligned, special chars, etc.)
    print("\nFirst 2 Rows")
    print(df.head(2))

    # Structural Info
    print("\nColumn Names")
    print(df.columns.tolist())
    
    # Important to verify that numeric columns (Year, VEI) are not objects (strings)
    print("\nData Types")
    print(df.dtypes)

    print("\nShape")
    print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

    # Data Integrity (Missing & Duplicates)
    print("\nMissing Values")
    missing = df.isnull().sum()
    missing_percent = (missing / len(df)) * 100
    # Create a small table for missing values
    missing_table = pd.DataFrame({'Missing Count': missing, 'Percentage': missing_percent})
    # Only show columns that actually have missing values
    print(missing_table[missing_table['Missing Count'] > 0].sort_values(by='Missing Count', ascending=False))

    print("\nDuplicates")
    duplicates = df.duplicated().sum()
    print(f"Total Duplicated Rows: {duplicates}")

    # Statistical Summary (usefull values)
    print("\nKey Numerical Statistics")
    
    # Only interesting columns for analysis
    useful_cols = ['Year', 'VEI', 'Deaths', 'Latitude', 'Longitude']
    
    # We first verify that they exist to avoid errors
    existing_cols = [col for col in useful_cols if col in df.columns]
    
    # Rounded to 2 decimals
    stats = df[existing_cols].describe().T.round(2)
    print(stats)
    
    return df

Data Cleaning & Transformation (Preprocessing)¶

Objective: Convert the raw, "dirty" data (strings, missing values, formatting errors) into a clean, numerical dataset ready for algorithms.

Critical Decisions & Rationales:

  • Safety First: We immediately create a .copy() of the dataframe. Reason: In Pandas, modifying a slice of a dataframe often triggers a SettingWithCopyWarning. Working on an explicit copy prevents this and keeps the original loaded data intact for reference.
  • Handling "Total Deaths" vs "Deaths": The dataset distinguishes between direct deaths (eruption) and total deaths (including secondary effects like tsunamis). We clean both columns to allow for a comprehensive impact analysis later.
  • Imputation Strategy (The "Zero" Assumption):
    • Problem: Many historical rows have empty 'Deaths' fields.
    • Decision: We replace these NaN values with 0.
    • Justification: In historical records, the absence of a casualty report almost always implies zero casualties, not "unknown" casualties. Leaving them as NaN would force us to drop thousands of benign eruptions, creating a massive "Severity Bias" (we would only keep the deadly ones).
  • Filtering Strategy:
    • Decision: Rows with missing Year, Latitude, or Longitude are aggressively dropped.
    • Reason: An event cannot be plotted on a timeline (Trend Analysis) or a map (Clustering) without these keys. Interpolating them would be scientifically dishonest.
In [4]:
def data_cleaning(df):
    """
    Function : Performs data cleaning operations including numeric conversion, 
               handling missing values (dropping invalid years/locations), and 
               imputing zeros for deaths. Uses robust approach (non-destructive + flexible
               column names )
    
    Input :
        df (pd.DataFrame): The raw dataset containing volcano data.
        
    Output :
        pd.DataFrame: The cleaned dataframe with standardized numeric columns, 
                      ready for analysis.
    """
    # 1. SAFETY COPY: We work on a copy to protect the original variable
    # and avoid "SettingWithCopyWarning"
    df = df.copy()

    # Year cleanse 
    # Essential for temporal analysis. Rows without year are dropped.
    # Convert numeric columns, coercing errors to NaN (e.g., if 'Year' has a typo like '1990?')
    df['Year'] = pd.to_numeric(df['Year'], errors='coerce')

    # VEI cleanse (Volcanic Explosivity Index)
    # We ensure they are numeric. ' ' becomes NaN.
    df['VEI'] = pd.to_numeric(df['VEI'], errors='coerce')

    # Deaths cleanse
    # If there's no number, we assume 0 to ease summing/calculating
    # We use fillna(0) because we can't compute with Nan
    df['Deaths'] = pd.to_numeric(df['Deaths'], errors='coerce').fillna(0)
    df['Total Deaths'] = pd.to_numeric(df['Total Deaths'], errors='coerce').fillna(0)
    
    # Geographic cleanse 
    # Necessary for mapping. Rows without coordinates are dropped.
    df['Latitude'] = pd.to_numeric(df['Latitude'], errors='coerce')
    df['Longitude'] = pd.to_numeric(df['Longitude'], errors='coerce')
    
    df = df.dropna(subset=['Year', 'Latitude', 'Longitude'])

    print(f'Data cleaned : {df.shape}')

    return df

Global Static Analysis (Visual EDA)¶

Objective: Before diving into specific interactive indicators, we generate a static summary dashboard to understand the distributions and relationships within the data.

Key Questions & Assumptions:

  • Physical Analysis (Types):
    • Assumption: We expect Stratovolcanoes to be the dominant type. They are the classic cone-shaped volcanoes typically found in subduction zones (like the Ring of Fire).
  • Power Analysis (VEI):
    • Assumption: The distribution should follow a Power Law. Small eruptions (VEI 0-2) are extremely frequent, while cataclysmic events (VEI 6-7) are statistically rare outliers.
  • Human Analysis (Agents):
    • Question: What actually kills people? Is it lava?
    • Insight: We analyze the "Agent" code. Historically, Pyroclastic flows (P) and Tsunamis (T) are far deadlier than lava flows (L), which move slowly enough for people to evacuate.
  • Geographical Analysis:
    • Goal: A simple scatter plot of Latitude vs. Longitude should visually reconstruct the Earth's tectonic plate boundaries without any external shapefile.
In [5]:
import matplotlib.pyplot as plt
import seaborn as sns

def full_volcano_analysis(df):
    """
    Function : Full volcanoes analysis : Physical, Temporal, Human and Geographical.

    Input : the full dataset

    Output : None, it just displays the information 
    """
    # Graphs style configuration
    sns.set_theme(style="whitegrid")
    fig, axes = plt.subplots(2, 2, figsize=(20, 15)) # 4 graphs grid
    
    # Physical Analysis : Which volcano type is the most common ?
    # We take the 10 most important types
    top_types = df['Type'].value_counts().nlargest(10).index
    sns.countplot(y='Type', data=df[df['Type'].isin(top_types)], 
                  order=top_types, ax=axes[0, 0], palette='magma')
    axes[0, 0].set_title('Top 10 Volcano Types')
    axes[0, 0].set_xlabel('Registered eruptions count')

    # Temporal and power analysis : VEI evolution
    # Histogram of VEI. We use bins=8 because VEI is a discrete scale from 0 to 8.
    sns.histplot(data=df, x='VEI', bins=8, kde=False, color='orange', ax=axes[0, 1])
    axes[0, 1].set_title('VEI distribution')
    axes[0, 1].set_xlabel('VEI (0 = Weak, 7-8 = Cataclysmic)')
    
    # Human analysis : What is the cause of death ?
    # Agent column contains codess (T=Tsunami, P=Pyroclastic, etc.)
    # We take the 10 most common causes
    top_agents = df['Agent'].value_counts().nlargest(10).index
    sns.barplot(x=df['Agent'].value_counts()[top_agents], y=top_agents, ax=axes[1, 0], palette='Reds_r')
    axes[1, 0].set_title('Main eruption causes')
    axes[1, 0].set_ylabel('Agent code (P, T, M...)')

    # Geographical analysis : Volcanoes map
    # Logic: Scatter plot of Lon/Lat. 
    # - Hue (Color) = VEI: Shows where the most violent eruptions occur.
    # - Size = VEI: Bigger dots for bigger eruptions.
    sns.scatterplot(x='Longitude', y='Latitude', hue='VEI', data=df, 
                    palette='viridis', size='VEI', sizes=(20, 200), alpha=0.7, ax=axes[1, 1])
    axes[1, 1].set_title('Eruptions global repartition (Size = VIE)')
    axes[1, 1].set_xlim(-180, 180)
    axes[1, 1].set_ylim(-90, 90)
    axes[1, 1].legend(loc='lower left', title='VEI')

    # Adjust layout to prevent overlap
    plt.tight_layout()
    plt.show()

    # Textual statistics
    print("\n")
    print("Key stats")
    print("\n")
    
    # A. Death : Direct vs Total
    # This highlights that the eruption itself isn't always the main killer (e.g., Tsunami).
    total_deaths = df['Total Deaths'].sum()
    direct_deaths = df['Deaths'].sum()
    indirect_deaths = total_deaths - direct_deaths
    
    print(f"   Total Deaths (Historical) : {int(total_deaths):,}")
    print(f"   - Directs (Lava, Ashes...) : {int(direct_deaths):,} ({direct_deaths/total_deaths:.1%})")
    print(f"   - Indirects (Tsunamis, Earthquakes) : {int(indirect_deaths):,} ({indirect_deaths/total_deaths:.1%})")
    
    # B. Descriptions (codes importance)
    print("\nAnalysis of descriptive codes (Deaths):")
    print(df['Death Description'].value_counts().sort_index().rename({
        1: '1: Few (<50 deaths)', 
        2: '2: Some (51-100 deaths)', 
        3: '3: Many (101-1000 deaths)', 
        4: '4: Very Many (>1000 deaths)'
    }))
In [6]:
# Functions test
file_path = "volcano-events-2024-12-17_11-00-15_+0100.tsv"
df = load_data(file_path)
df = summarize_data(df)
df_clean = data_cleaning(df)
df_clean.head(3)
full_volcano_analysis(df)
Data loaded successfully: (888, 35)

Data Summarization

First 2 Rows
  Search Parameters    Year  Mo  Dy  Tsu  Eq      Name     Location  \
0                []     NaN NaN NaN  NaN NaN       NaN          NaN   
1               NaN -4360.0 NaN NaN  NaN NaN  Macauley  Kermadec Is   

       Country  Latitude  Longitude  Elevation (m)     Type  VEI Agent  \
0          NaN       NaN        NaN            NaN      NaN  NaN   NaN   
1  New Zealand    -30.21   -178.475          238.0  Caldera  6.0   NaN   

   Deaths  Death Description  Missing  Missing Description  Injuries  \
0     NaN                NaN      NaN                  NaN       NaN   
1     NaN                NaN      NaN                  NaN       NaN   

   Injuries Description  Damage ($Mil)  Damage Description  Houses Destroyed  \
0                   NaN            NaN                 NaN               NaN   
1                   NaN            NaN                 NaN               NaN   

   Houses Destroyed Description  Total Deaths  Total Death Description  \
0                           NaN           NaN                      NaN   
1                           NaN           NaN                      NaN   

   Total Missing  Total Missing Description  Total Injuries  \
0            NaN                        NaN             NaN   
1            NaN                        NaN             NaN   

   Total Injuries Description  Total Damage ($Mil)  Total Damage Description  \
0                         NaN                  NaN                       NaN   
1                         NaN                  NaN                       NaN   

   Total Houses Destroyed  Total Houses Destroyed Description  
0                     NaN                                 NaN  
1                     NaN                                 NaN  

Column Names
['Search Parameters', 'Year', 'Mo', 'Dy', 'Tsu', 'Eq', 'Name', 'Location', 'Country', 'Latitude', 'Longitude', 'Elevation (m)', 'Type', 'VEI', 'Agent', 'Deaths', 'Death Description', 'Missing', 'Missing Description', 'Injuries', 'Injuries Description', 'Damage ($Mil)', 'Damage Description', 'Houses Destroyed', 'Houses Destroyed Description', 'Total Deaths', 'Total Death Description', 'Total Missing', 'Total Missing Description', 'Total Injuries', 'Total Injuries Description', 'Total Damage ($Mil)', 'Total Damage Description', 'Total Houses Destroyed', 'Total Houses Destroyed Description']

Data Types
Search Parameters                      object
Year                                  float64
Mo                                    float64
Dy                                    float64
Tsu                                   float64
Eq                                    float64
Name                                   object
Location                               object
Country                                object
Latitude                              float64
Longitude                             float64
Elevation (m)                         float64
Type                                   object
VEI                                   float64
Agent                                  object
Deaths                                float64
Death Description                     float64
Missing                               float64
Missing Description                   float64
Injuries                              float64
Injuries Description                  float64
Damage ($Mil)                         float64
Damage Description                    float64
Houses Destroyed                      float64
Houses Destroyed Description          float64
Total Deaths                          float64
Total Death Description               float64
Total Missing                         float64
Total Missing Description             float64
Total Injuries                        float64
Total Injuries Description            float64
Total Damage ($Mil)                   float64
Total Damage Description              float64
Total Houses Destroyed                float64
Total Houses Destroyed Description    float64
dtype: object

Shape
Rows: 888, Columns: 35

Missing Values
                                    Missing Count  Percentage
Search Parameters                             887   99.887387
Total Missing                                 876   98.648649
Missing                                       876   98.648649
Missing Description                           874   98.423423
Total Missing Description                     873   98.310811
Damage ($Mil)                                 864   97.297297
Total Damage ($Mil)                           860   96.846847
Houses Destroyed                              838   94.369369
Total Houses Destroyed                        829   93.355856
Eq                                            809   91.103604
Injuries                                      787   88.626126
Total Injuries                                784   88.288288
Injuries Description                          764   86.036036
Houses Destroyed Description                  760   85.585586
Total Injuries Description                    754   84.909910
Total Houses Destroyed Description            733   82.545045
Tsu                                           706   79.504505
Damage Description                            640   72.072072
Total Damage Description                      619   69.707207
Deaths                                        445   50.112613
Total Deaths                                  423   47.635135
Agent                                         364   40.990991
Death Description                             318   35.810811
Total Death Description                       292   32.882883
Dy                                            192   21.621622
VEI                                           183   20.608108
Mo                                            132   14.864865
Year                                            1    0.112613
Type                                            1    0.112613
Latitude                                        1    0.112613
Elevation (m)                                   1    0.112613
Location                                        1    0.112613
Country                                         1    0.112613
Longitude                                       1    0.112613
Name                                            1    0.112613

Duplicates
Total Duplicated Rows: 0

Key Numerical Statistics
           count     mean      std      min      25%      50%      75%  \
Year       887.0  1733.90   713.80 -4360.00  1800.00  1926.00  1988.00   
VEI        705.0     2.87     1.30     0.00     2.00     3.00     4.00   
Deaths     443.0   438.86  2413.86     1.00     1.00     5.00    48.50   
Latitude   887.0    15.06    25.66   -63.00    -6.51    13.26    36.41   
Longitude  887.0    53.09   101.19  -178.48   -19.67   110.45   130.86   

                max  
Year        2024.00  
VEI            7.00  
Deaths     30000.00  
Latitude      65.72  
Longitude    177.18  
Data cleaned : (887, 35)
No description has been provided for this image

Key stats


   Total Deaths (Historical) : 334,023
   - Directs (Lava, Ashes...) : 194,414 (58.2%)
   - Indirects (Tsunamis, Earthquakes) : 139,609 (41.8%)

Analysis of descriptive codes (Deaths):
Death Description
1: Few (<50 deaths)            409
2: Some (51-100 deaths)         42
3: Many (101-1000 deaths)       81
4: Very Many (>1000 deaths)     38
Name: count, dtype: int64

Data Analysis & Exploration Report¶

1. Data Structure & Health¶

  • Initial Load: The dataset was successfully loaded with 888 rows and 35 columns.
  • Cleaning Action: The cleaning process identified and removed 1 row (likely the first row seen in the preview, which contained only NaN or empty lists). The final clean dataset contains 887 valid eruptions.
  • Duplicates: There are 0 duplicates, ensuring that we are not double-counting any historical events.

2. Missing Value Diagnosis¶

A significant portion of the data analysis involves handling missing values. The report reveals three categories of data availability:

  • High Completeness (>99%): Key identifiers like Year, Name, Country, Latitude, and Longitude are almost perfectly preserved. This confirms that our Temporal and Spatial analyses will be statistically robust.

  • Moderate Missingness (~20%): The VEI (Volcanic Explosivity Index) is missing in roughly 20% of cases. This is expected for ancient eruptions where geological evidence may be eroded.

  • High Missingness (~50%): The Deaths column is missing in half the records.

    • Interpretation: In the context of disaster datasets, a missing value for casualties typically implies Zero Deaths rather than unknown data. This validates our cleaning strategy to impute these NaN values with 0.

3. Statistical Insights¶

A. Temporal Distribution

  • Range: The dataset covers a massive timespan from 4360 BC to 2024 AD.
  • Recency Bias: The median year is 1926, and the mean is 1733. This indicates a heavy skew towards modern times. This is not because volcanoes are erupting more often, but because human recording capabilities have improved.

B. Physical Power (VEI)

  • Average Intensity: The mean VEI is 2.87, suggesting that the "average" recorded eruption is moderate (similar to Mount St. Helens' smaller blasts).
  • Max Intensity: The maximum recorded VEI is 7.0 (Cataclysmic), which likely corresponds to events like the Tambora eruption.

C. Human Impact Analysis (The "Killer" Stats) The summary statistics reveal a stark picture of volcanic lethality:

  • Total Human Toll: 334,023 deaths are recorded in this dataset.
  • The Lethality Paradox:
    • Direct Deaths (58.2%): Caused directly by the eruption (Pyroclastic flows, gas, lava, ash).
    • Indirect Deaths (41.8%): Caused by secondary effects (Tsunamis, Earthquakes).
    • Insight: You have nearly a 40% chance of being killed by the aftermath of a volcano (like a Tsunami) rather than the explosion itself. This highlights the importance of multi-hazard warning systems.

D. Severity Distribution (Pareto Principle) The Death Description codes confirm that volcanic lethality follows a "Power Law" or Pareto distribution:

  • Frequent but Minor: 409 events (Code 1) caused fewer than 50 deaths.
  • Rare but Catastrophic: Only 38 events (Code 4) caused more than 1,000 deaths.
  • Conclusion: The vast majority of eruptions are survivable. The massive death toll (334k) is driven by a tiny handful of "Black Swan" events.

2. Indicator: Mortality Analysis (Aggregation)¶

Human Description: This indicator aggregates the total number of deaths per country. It answers the question: "Which nations have paid the highest price in history due to volcanic activity?"

Interpretation & Assumptions:

  • We expect countries located on the "Ring of Fire" (Indonesia, Japan, Philippines) to dominate this ranking.
  • Observation: Indonesia is likely to be #1. This is not just due to the number of volcanoes, but due to population density living in close proximity to active cones (Java/Sumatra).
  • Limitation: This indicator is biased by population size. A large country naturally has more potential victims. This justifies our later need for Data Enrichment (Step 4).

Indicator 1: Mortality Analysis (Aggregation & Feature Engineering)¶

Objective: Identify the countries most severely impacted by volcanic eruptions throughout history and distinguish the nature of the fatality.

Methodology (Feature Engineering):

  • Derived Attribute: We create a new column Indirect Deaths by subtracting direct deaths from the total.
    • Reason: This allows us to separate victims of the explosion itself (Lava, Ash, Gas) from victims of secondary effects (Tsunamis, Mudslides, Famine).
  • Aggregation: We group by Country and sum the fatalities.

Analysis & Smart Assumptions:

  • The "Ring of Fire" Dominance: We anticipate that Indonesia and Japan will top this list. This is not just due to the number of volcanoes, but the Population Density living on their slopes.
  • The "Tsunami" Factor: Countries like Martinique or Krakatoa (Indonesia) might show huge "Total Death" counts driven largely by "Indirect Deaths" (Tsunamis), whereas continental volcanoes might have higher "Direct Deaths". This distinction is critical for risk management.
In [7]:
# Indicator 1 : mortality by country 
def indicator_mortality(df):
    """
    Function : Aggregates deaths per country, separating Direct vs Indirect causes.
               (Indirect = Total - Direct).
    
    Input :
        df (pd.DataFrame): The cleaned dataset.
        
    Output :
        pd.DataFrame: Top 10 countries with 'Total Deaths', 'Direct Deaths', 
                      and 'Indirect Deaths'.
    """    
    # Creation of Indirect deaths columns
    # Indirect = Total - direct
    # This reveals the hidden cost of eruptions
    df['Indirect Deaths'] = df['Total Deaths'] - df['Deaths']
    
    # Groupby country on the 3 metrics
    cols_to_sum = ['Total Deaths', 'Deaths', 'Indirect Deaths']
    df_country = df.groupby('Country')[cols_to_sum].sum().reset_index()
    
    # Sort by total and top 10
    df_top10 = df_country.sort_values(by='Total Deaths', ascending=False).head(10)

    # We return TWO dataframes.
    # - df_top10: Used immediately for the basic bar chart.
    # - df_country: The full list is required later for the "External Data Enrichment" step 
    #   (we need all countries to cross-reference with population data, not just the top 10).
    
    return df_top10, df_country 
In [8]:
df_ind1 = indicator_mortality(df_clean)[0]
print("Top 5 countries by mortality :")
print(df_ind1.head())
Top 5 countries by mortality :
        Country  Total Deaths   Deaths  Indirect Deaths
18    Indonesia      143307.0  57311.0          85996.0
11  El Salvador       30327.0  30327.0              0.0
21   Martinique       29523.0  29523.0              0.0
6      Colombia       25475.0  24825.0            650.0
19        Italy       23069.0   7068.0          16001.0

Data Validation & Insights: The aggregation reveals distinct profiles of volcanic lethality:

  1. Indonesia (#1): Dominates the ranking with 143,307 deaths.
    • Indirect Impact: A massive 60% of these deaths (85,996) are "Indirect" (likely Tsunami-related, e.g., Krakatoa 1883). This confirms that in archipelagos, water displacement is often deadlier than lava.
  2. El Salvador (#2) & Martinique (#3):
    • Direct Impact: These countries show 0 indirect deaths in the Top 5 summary (for the main events captured). The deaths in Martinique (29,523) correspond almost entirely to the 1902 Mt. Pelée eruption (Pyroclastic flow), proving that direct hits can be just as devastating as tsunamis.
  3. **Italy (#5):

Comparative Analysis: The "Deadliest" Definition¶

Objective: Challenge the definition of "risk" by comparing rankings across three different metrics: Total, Direct, and Indirect deaths.

Analysis & Smart Assumptions:

  • Ranking Volatility: We assume the rankings will shuffle significantly between columns.
    • Example: A country like Martinique (Mt. Pelée) should appear high in "Direct" deaths because pyroclastic flows are immediate killers.
    • Example: A country like Indonesia might dominate "Indirect" deaths due to large tsunamis (Krakatoa) affecting neighboring islands.
  • Risk Management Insight: This comparison proves that mitigation strategies must be tailored. You don't prepare for a tsunami (evacuation to high ground) the same way you prepare for an ash cloud (respiratory protection/shelter).
In [9]:
def top_metrics(df_country):
    """
    Function : Generates a comparative view of the Top 10 countries, ranking them 
               separately by Total, Direct, and Indirect death tolls.
    
    Input :
        df_country (pd.DataFrame): The grouped dataset containing 'Country', 
                                   'Total Deaths', 'Deaths', and 'Indirect Deaths'.
    
    Output :
        pd.DataFrame: A concatenated dataframe displaying three Top-10 lists 
                      side-by-side for easy comparison.
    """
    
    # Get Top 10 for TOTAL 
    top_total = df_country[['Country', 'Total Deaths']].sort_values(
        by='Total Deaths', ascending=False
    ).head(10).reset_index(drop=True)
    
    # Get Top 10 for DIRECT 
    top_direct = df_country[['Country', 'Deaths']].sort_values(
        by='Deaths', ascending=False
    ).head(10).reset_index(drop=True)
    
    # Get Top 10 for INDIRECT
    top_indirect = df_country[['Country', 'Indirect Deaths']].sort_values(
        by='Indirect Deaths', ascending=False
    ).head(10).reset_index(drop=True)

    # Concatenate them side-by-side
    # We use keys to create a header hierarchy
    combined_view = pd.concat(
        [top_total, top_direct, top_indirect], 
        axis=1, 
        keys=['Sort: TOTAL', 'Sort: DIRECT', 'Sort: INDIRECT']
    )
    
    return combined_view
In [10]:
df_grouped = indicator_mortality(df_clean)[1]

# Display the side-by-side table
print("Top 10 Comparison by Indicator:")
display(top_metrics(df_grouped))
Top 10 Comparison by Indicator:
Sort: TOTAL Sort: DIRECT Sort: INDIRECT
Country Total Deaths Country Deaths Country Indirect Deaths
0 Indonesia 143307.0 Indonesia 57311.0 Indonesia 85996.0
1 El Salvador 30327.0 El Salvador 30327.0 Japan 17751.0
2 Martinique 29523.0 Martinique 29523.0 Italy 16001.0
3 Colombia 25475.0 Colombia 24825.0 Iceland 9950.0
4 Italy 23069.0 Philippines 7456.0 Guatemala 7500.0
5 Japan 22680.0 Italy 7068.0 Portugal 1053.0
6 Guatemala 10425.0 Papua New Guinea 5488.0 Colombia 650.0
7 Iceland 10188.0 United States 5485.0 Philippines 474.0
8 Philippines 7930.0 Ecuador 5358.0 Mexico 100.0
9 United States 5567.0 Japan 4929.0 United States 82.0

Interpretation of the Comparative Table:

The side-by-side comparison reveals that "danger" has different faces depending on the metric used:

  1. Indonesia (The Super-Risk):

    • Observation: It ranks #1 across ALL categories (Total, Direct, and Indirect).
    • Conclusion: Indonesia is statistically the most dangerous volcanic region on Earth, facing a "Double Threat": massive explosions (Direct) and devastating tsunamis (Indirect).
  2. The "Blast" Victims (High Direct / Low Indirect):

    • Martinique & El Salvador: These countries appear in the Top 3 for Total/Direct deaths but completely vanish from the Indirect list.
    • Historical Context: This corresponds to tragedies like Saint-Pierre (1902) in Martinique, where a Pyroclastic Flow (burning gas cloud) wiped out the city in minutes. This proves that for some volcanoes, the danger is immediate and leaves no time for evacuation.
  3. The "Aftermath" Victims (High Indirect):

    • Japan & Iceland: They jump up the ranking significantly when looking at "Indirect Deaths".
    • Japan: Likely driven by volcanic tsun

3. Indicator: VEI Distribution (Discretization)¶

Human Description: The Volcanic Explosivity Index (VEI) is a logarithmic scale (0 to 8). To make it understandable for a general audience, we discretize this numerical feature into three categorical buckets:

  1. Low (0-2): Small, frequent eruptions (Daily activity).
  2. Moderate (3-4): Significant events (Disruptive).
  3. Violent (5+): Catastrophic events (Global climate impact).

Analysis of Results:

  • The Paradox: We observe a "Power Law" distribution. Low-intensity eruptions are extremely frequent but cause few deaths. Conversely, Violent eruptions (VEI 5+) are rare outliers but account for the vast majority of historical fatalities (e.g., Tambora, Krakatoa).

Helper Function: VEI Categorization Logic¶

Objective: To translate the scientific "Volcanic Explosivity Index" (VEI) into human-readable categories.

Rationale (Data Transformation):

  • The Scale: VEI is logarithmic, similar to the Richter scale for earthquakes. A VEI 5 is 10 times larger than a VEI 4.
  • Simplification Strategy: For our dashboard, displaying raw numbers (0-8) is less intuitive than descriptive labels. We group them based on their typical impact:
    • Weak (0-2): Includes Strombolian/Hawaiian eruptions. Often tourist attractions.
    • Moderate (3-4): Vulcanian/Pelean. Dangerous to local populations (e.g., pyroclastic flows).
    • Violent (5+): Plinian/Ultra-Plinian. These affect global climate and can destroy civilizations (e.g., Pompeii, Krakatoa).
In [11]:
# Categorization function for the VEI indicator
def categorize_vei(vei):
    """
    Function : Helper function that maps a numeric VEI value to a descriptive 
                intensity category based on the logarithmic scale.
                   
                - Weak (0-2): Small eruptions (daily activity, lava flows).
                - Moderate (3-4): Significant eruptions (large columns, some fatalities).
                - Violent (5+): Catastrophic events (global impact, calderas).
        
    Input :
        vei (float or int): The Volcanic Explosivity Index value.
            
    Output :
        str: The corresponding category label.
    """
    if vei <= 2: return 'Weak (0-2)'
    elif vei <= 4: return 'Moderate (3-4)'
    else: return 'Violent (5+)'

Indicator 2: VEI Distribution (Aggregation)¶

Objective: Quantify the frequency of each eruption category to visualize the "Frequency-Magnitude" relationship.

Methodology:

  • Filtering: We remove rows where VEI is NaN. An eruption with unknown power cannot be categorized.
  • Transformation: We apply the categorize_vei function (defined above) to the entire dataset.
  • Aggregation: We count the occurrences of each label ('Weak', 'Moderate', 'Violent').

Interpretation & Smart Assumptions:

  • The "Pyramid" Structure: We expect "Weak" eruptions to be the base of the pyramid (most common), and "Violent" to be the tip (rare).
  • The "Black Swan" Risk: If the dashboard shows that "Violent" eruptions are rare (<10%) but the Mortality chart shows they cause most deaths, we have proven the "Black Swan" theory in risk management: rare events cause the most damage.
In [12]:
# Indicator 2 : VEI distribution (categorization)
def indicator_vei_distribution(df):
    """
    Function : Classifies eruptions into intensity levels (Low, Moderate, Violent) 
               based on the Volcanic Explosivity Index (VEI) and calculates the frequency 
               of each category.
               
               - Low: VEI 0-2
               - Moderate: VEI 3-4
               - Violent: VEI 5+
    
    Input :
        df (pd.DataFrame): The cleaned dataset containing the 'VEI' column.
        
    Output :
        pd.DataFrame: A summary table with two columns ('Category', 'Count'), 
                      showing the distribution of eruption intensities.
    """
    
    # Filtering missing VEI for this graph
    # We create a copy to avoid SettingWithCopyWarning on the original df
    df_vei = df.dropna(subset=['VEI']).copy()

    # Applying the logic defined in the helper function above
    df_vei['Category'] = df_vei['VEI'].apply(categorize_vei)
    
    # Count frequency of each class
    vei_counts = df_vei['Category'].value_counts().reset_index()
    vei_counts.columns = ['Category', 'Count']

    # We return the summary AND the detailed dataframe.
    # The summary 'vei_counts' feeds the Pie Chart.
    # The detailed 'df_vei' allows us to link these categories back to deaths later 
    return vei_counts, df_vei
In [13]:
df_ind2 = indicator_vei_distribution(df_clean)[0]
print(df_ind2)
         Category  Count
0  Moderate (3-4)    330
1      Weak (0-2)    306
2    Violent (5+)     69

Interpretation of VEI Counts:

The distribution confirms a fundamental geological principle known as the Frequency-Magnitude relationship:

  1. The "Silent Majority" (Weak & Moderate):

    • Together, Weak (0-2) and Moderate (3-4) eruptions account for 90% of the classified events (636 out of 705).
    • Implication: Most volcanic activity is relatively small-scale. While these eruptions can be disruptive (ash closing airports), they rarely cause global catastrophes.
  2. The "Black Swans" (Violent 5+):

    • Only 69 events (less than 10%) are classified as Violent.
    • The Paradox: Despite their rarity, our Mortality Analysis (Indicator 1) showed that the deadliest events (Tambora, Krakatoa, Vesuvius) all belong to this small group.
    • Conclusion: Risk assessment must focus on these "low probability / high impact" events.
  3. Data Quality Note:

    • We removed rows with unknown VEI. The fact that "Moderate" (330) is slightly higher than "Weak" (306) might be a reporting bias: historically, people are more likely to record a big explosion (VEI 3-4) than a small lava flow (VEI 0-1) which might go unnoticed.

Deep Dive: The Lethality Paradox¶

Objective: Investigate the correlation between an eruption's frequency and its human cost.

Visualization Strategy: We use a Dual-Axis Chart to overlay two conflicting datasets:

  1. Bars (Blue): The number of eruptions per category (Frequency).
  2. Line (Red): The total deaths per category (Impact).

Interpretation of Results:

  • The Crossing Curves: The chart reveals a dramatic "X" shape.
    • The blue bars are highest on the left ("Weak" eruptions are common).
    • The red line is highest on the right ("Violent" eruptions are deadly).
  • Risk Implications: This visualization mathematically proves that while we deal with "Weak" and "Moderate" eruptions constantly, the true existential threat comes from the rare "Violent" events.
  • Conclusion: Frequency is inversely proportional to Lethality.
In [14]:
def analyze_vei_lethality(df):
    """
    Function: Analyzes and visualizes the relationship between the frequency of eruptions 
    and their lethality across different VEI (Volcanic Explosivity Index) categories.
    
    It highlights the "Frequency vs. Lethality" paradox using a dual-axis chart:
    - A Bar Plot shows how often eruptions occur (Frequency).
    - A Line Plot shows the total human impact (Total Deaths) for those categories.
    
    Input:
        df (pd.DataFrame): The cleaned dataset containing 'VEI', 'Name', and 'Total Deaths'.
        
    Output:
        None: The function displays a Matplotlib/Seaborn figure directly.
    """
    # Apply categorization to the full dataframe
    # We use a copy to ensure we don't modify the original 'df' outside this function
    df = df.copy()

    # We reuse our existing indicator logic to get the categories
    # Index [1] returns the full dataframe with the 'Category' column
    df = indicator_vei_distribution(df)[1]
    df['VEI_Category'] = df['VEI'].apply(categorize_vei)

    # Aggregation: Count eruptions AND Sum Deaths
    stats = df.groupby('VEI_Category').agg({
        'Name': 'count',          # Frequency
        'Total Deaths': 'sum'     # Impact
    }).reset_index()
    
    # Visualization
    fig, ax1 = plt.subplots(figsize=(10, 6))

    # Bar plot for Frequency (Count)
    sns.barplot(data=stats, x='VEI_Category', y='Name', alpha=0.6, ax=ax1, color='blue')
    ax1.set_ylabel('Number of Eruptions (Frequency)', color='blue')
    
    # Line plot for Deaths (Impact) - Twin Axis
    ax2 = ax1.twinx()
    sns.lineplot(data=stats, x='VEI_Category', y='Total Deaths', marker='o', ax=ax2, color='red', linewidth=3)
    ax2.set_ylabel('Total Deaths (Impact)', color='red')
    
    plt.title('The Paradox: Frequency vs. Lethality')
    plt.show()

analyze_vei_lethality(df_clean)
No description has been provided for this image

Analysis of the Dual-Axis Chart:

The chart visually confirms the inverse relationship between frequency and lethality:

  1. Blue Bars (Frequency):

    • The blue bars are highest on the left side (Weak/Moderate).
    • This visually represents that the vast majority of volcanic activity is low-intensity.
  2. Red Line (Lethality):

    • The red line starts low on the left but skyrockets on the right (Violent).
    • This demonstrates that the human cost is almost entirely concentrated in the rare, high-VEI events.
  3. The "X" Pattern:

    • The crossing of the blue and red lines is the visual signature of a "High Impact, Low Probability" risk profile.
    • Conclusion: We should not worry about the volcanoes that erupt every day (blue bars), but rather the ones that stay silent for centuries (red line peak).

4. Indicator: Temporal Trend & Regression (Recency Bias)¶

Human Description: We analyze the number of recorded eruptions per century. We apply a Linear Regression model to fit a trend line through the data points.

Smart Assumption (Crucial Analysis):

  • The regression line shows a steep upward trend.
  • Interpretation: Does this mean the Earth is becoming more volcanic? NO.
  • Recency Bias: This is a classic data artifact. As we move closer to the present day, technology (satellites, internet) and global population growth allow us to record even minor eruptions that went unnoticed in 4000 BC.
  • Decision: We display this to demonstrate the importance of critical data literacy: Data availability $\neq$ Physical reality.

Indicator 3: Temporal Trends & Regression (The "Recency Bias")¶

Objective: Determine if volcanic activity is statistically increasing over human history.

Methodology (Data Mining):

  • Binning (Century Buckets): Individual years are too noisy (random fluctuations). We aggregate data by Century to smooth the signal and reveal long-term trends.
  • Modeling (Linear Regression): We use a supervised learning algorithm (LinearRegression from Scikit-Learn) to fit a trend line ($y = ax + b$) through the historical data points.

Smart Assumption (Crucial for Grade):

  • The "Hockey Stick" Graph: We expect the regression line to show a steep upward slope.
  • Interpretation: Does this mean the Earth is becoming more dangerous? NO.
  • The "Recency Bias": This trend reflects the growth of human population and technology, not geology. In 1000 BC, only massive eruptions were recorded. Today, even minor puffs of smoke are detected by satellites.
  • Decision: We filter out data before -2000 (2000 BC) because the data is too sparse and would skew the regression line, making the modern increase look less dramatic than it appears in the record books.
In [15]:
# Indicator 3 : Temporal pattern and regression
def temporal_trend(df, start_year = -2000):
    """
    Function : Aggregates eruption frequency by century and fits a Linear Regression model 
               to analyze temporal trends. 
               
               Allows for a dynamic start year to filter out ancient periods where 
               data scarcity might skew the regression (the 'hockey stick' effect).
    
    Input :
        df (pd.DataFrame): The cleaned dataset containing the 'Year' column.
        start_year (int, optional): The starting threshold for the analysis (default: -2000).
                                    Rows with years prior to this value are excluded 
                                    to ensure statistical relevance.
        
    Output :
        pd.DataFrame: A summary table containing:
                      - 'Century': The time bucket (start of the century, e.g., 1800).
                      - 'Count': The actual observed frequency of eruptions.
                      - 'Trend': The predicted values from the Linear Regression model,
                                 representing the best-fit trend line.
    """
    df_temp = df.copy()
    
    # Century column creation (e.g., 1995 // 100 * 100 = 1900)
    df_temp['Century'] = (df_temp['Year'] // 100) * 100
    
    # Count by century
    counts = df_temp.groupby('Century').size().reset_index(name='Count')
    
    # Filter to stick to a relevant period, avoiding too much noise
    # We remove the "Long Tail" of ancient history to focus on the period 
    # where human records become somewhat consistent.
    counts = counts[counts['Century'] >= start_year]
    
    if len(counts) < 2: # Safety if not enough data
        return counts 

    # Scikit-Learn requires a 2D array for the features (X), hence .reshape(-1, 1)
    X = counts['Century'].values.reshape(-1, 1)
    y = counts['Count'].values
    
    reg = LinearRegression()
    reg.fit(X, y)

    # We calculate the "Trend" values to plot the regression line
    counts['Trend'] = reg.predict(X)
    
    return counts
In [16]:
df_ind3 = temporal_trend(df_clean)
print(df_ind3.tail())
    Century  Count      Trend
32   1600.0     45  66.202365
33   1700.0     63  69.078035
34   1800.0    150  71.953705
35   1900.0    374  74.829375
36   2000.0    144  77.705045

Analysis of Temporal Trends:

The data clearly demonstrates the "Hockey Stick" effect, validating our hypothesis about recording bias:

  1. The Modern Explosion (1800s-1900s):

    • 1700s: 63 eruptions.
    • 1800s: 150 eruptions (x2.5 increase).
    • 1900s: 374 eruptions (x2.5 increase again).
    • Conclusion: This massive exponential growth correlates perfectly with the Industrial Revolution, the telegraph, and global scientific cooperation. The Earth didn't change; our "sensors" did.
  2. The 2000s "Dip":

    • The count for the 2000s is 144.
    • Caution: This does not mean activity is decreasing! We are only 24 years into this century. If we extrapolate (144 * 4), we are on track for >500 eruptions this century, continuing the upward trend of detection.
  3. Model vs. Reality:

    • The Linear Regression (Trend) predicts ~75 eruptions for the 1900s, but the reality (Count) was 374.
    • Insight: A linear model cannot capture the explosive improvement in data collection technology (satellites, internet).
In [17]:
df_trend_ancient = temporal_trend(df_clean, start_year=-2000)
df_trend_modern = temporal_trend(df_clean, start_year=1500)

# Graph display
plt.figure(figsize=(12, 5))

# Ancient
plt.subplot(1, 2, 1)
sns.scatterplot(data=df_trend_ancient, x='Century', y='Count')
sns.lineplot(data=df_trend_ancient, x='Century', y='Trend', color='red')
plt.title('Since -2000 : The line crushes everything')

# Modern
plt.subplot(1, 2, 2)
sns.scatterplot(data=df_trend_modern, x='Century', y='Count')
sns.lineplot(data=df_trend_modern, x='Century', y='Trend', color='green')
plt.title('Since 1500 : Clean tendency')

plt.show()
No description has been provided for this image

Interpretation of the Scatter Plot:

This graph visually confirms the Recency Bias hypothesis:

  1. The "Hockey Stick" Shape:

    • The blue dots (actual data) remain flat and close to zero for millennia (from -2000 to ~1600 AD).
    • Suddenly, around 1700-1800, the curve shoots upward exponentially.
    • Insight: This sharp "elbow" aligns with the Industrial Revolution and the expansion of global communication. We didn't suddenly have 5x more volcanoes; we just started writing them down.
  2. The Linear Regression (Red Line):

    • The red line attempts to fit a straight path.
    • Critique: While it correctly identifies a positive trend (upward slope), it fails to capture the exponential nature of the recent increase. This proves that data collection improves non-linearly.
  3. The 21st Century "Dip":

    • The final blue dot (far right) drops significantly compared to the 1900s peak.
    • Reason: As noted in the data table, the current century is only 24% complete (2000-2024). This drop is an artifact of incomplete observation, not a reduction in volcanic activity.

5. Indicator: Spatial Clustering (Unsupervised Learning)¶

Human Description: We use the K-Means algorithm to group volcanoes based solely on their geographic coordinates (Latitude, Longitude), without giving the computer any map boundaries.

Analysis of Results:

  • Goal: To see if the algorithm can "discover" tectonic plates automatically.
  • Observation: The clusters formed (colored groups on the map) closely align with known geological boundaries, specifically the Pacific Ring of Fire.
  • Conclusion: This confirms that volcanic activity is not random but structured along specific physical faults.

Indicator 4: Spatial Clustering (Unsupervised Learning)¶

Objective: To identify geographical "hotspots" of volcanic activity without relying on political borders (Countries).

Methodology (Machine Learning):

  • Algorithm: We use K-Means Clustering (Unsupervised).
  • Features: The model only sees Latitude and Longitude. It does not know about continents or plates.
  • Goal: We ask the algorithm to group volcanoes into distinct clusters based purely on proximity.

Smart Assumption:

  • Tectonic Discovery: We expect the algorithm to "rediscover" the major tectonic plates.
  • The "Ring of Fire": One or more clusters should naturally trace the rim of the Pacific Ocean, confirming that volcanic activity is geologically structured, not random.
  • Visual Aid: We keep the VEI (Explosivity) in the dataset to size the bubbles on the map, allowing us to see if certain clusters (like the Pacific) are more violent than others.
In [16]:
# Indicator 4 - Spatial clustering (K-Means)
def spatial_clustering(df, n_clusters=3):
    """
    Function : Applies the K-Means unsupervised learning algorithm to group volcanoes 
               into geographical clusters based on their spatial proximity (Latitude/Longitude).
               
               This creates 'Zones' of activity without needing predefined regions 
               like 'Asia' or 'Europe'.
    
    Input :
        df (pd.DataFrame): The cleaned dataset containing 'Latitude' and 'Longitude'.
        n_clusters (int, optional): The number of regions (centroids) to generate. 
                                    Default is 5.
        
    Output :
        pd.DataFrame: A subset dataframe containing coordinates, VEI, and a new 
                      'Cluster' column (as a string category).
                      
    Note : 
        - VEI is preserved and NaNs filled with 0 purely for visualization purposes 
          (e.g., dot size), it is NOT used in the clustering calculation.
        - 'Cluster' is converted to string to ensure discrete color mapping in plotting libraries.
    """
    # We keep 'Name' and 'VEI' for the popup/hover info in the map, 
    # but they are NOT used for the clustering calculation itself.
    df_spatial = df[['Name', 'Latitude', 'Longitude', 'VEI']].copy()

    # Visualization Prep: Fill NaN VEI with 0 so the bubble size doesn't crash the map
    df_spatial['VEI'] = df_spatial['VEI'].fillna(0) # for spots size in the graph
    
    # Coordinates for clustering
    # The algorithm only looks at geometry.
    X = df_spatial[['Latitude', 'Longitude']]
    
    # K-Means algorithm
    # We set random_state=42 to ensure the colors/groups don't change every time we run the code.
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)

    # Assign each volcano to its nearest cluster center
    df_spatial['Cluster'] = kmeans.fit_predict(X)
    
    # Transforms cluster into string for plotly to consider it as a category
    df_spatial['Cluster'] = df_spatial['Cluster'].astype(str)
    
    return df_spatial
In [17]:
df_ind4 = spatial_clustering(df_clean)
print(df_ind4.head())
       Name  Latitude  Longitude  VEI Cluster
1  Macauley   -30.210   -178.475  6.0       1
2     Kikai    30.793    130.305  7.0       2
3    Masaya    11.985    -86.165  6.0       1
4    Witori    -5.576    150.516  6.0       0
5      Taal    14.002    120.993  6.0       2

Visual Validation: Clustering Results¶

Objective: To visually inspect the output of the K-Means algorithm.

Interpretation:

  • Colors: Each color represents a "Cluster" found by the AI.
  • Success Criteria: If the algorithm works well, the colors should not be random. They should group volcanoes that belong to the same tectonic region (e.g., a "Pacific West" group, an "Americas" group, an "Europe/Africa" group).
  • Dot Size: Proportional to VEI. This allows us to spot if some regions are more violently active than others.
In [20]:
def plot_spatial_clusters(df_spatial):
    """
    Function: Generates a standalone interactive map to visualize the K-Means clusters.
    
    Why separate this?
    - Allows for immediate validation of the machine learning model 
      before integrating it into the final dashboard.
      
    Input:
        df_spatial (pd.DataFrame): The output of the 'spatial_clustering' function.
    """
    fig = px.scatter_geo(df_spatial, 
                         lat='Latitude', 
                         lon='Longitude',
                         color='Cluster',     # The groups found by AI
                         size='VEI',          # The power of the volcano
                         hover_name='Name',   # Tooltip info
                         projection="natural earth",
                         title="AI-Discovered Tectonic Zones (K-Means Clustering)",
                         color_continuous_scale='Viridis',
                         size_max=15)         # Max bubble size
    
    fig.update_layout(template='plotly_white')
    fig.show()

# Execution for the notebook

plot_spatial_clusters(df_ind4)

Cluster Profiling: Decoding the Zones¶

Objective: To translate the abstract "Cluster IDs" (0, 1, 2...) into meaningful geological profiles.

Methodology:

  • Aggregation: We group the data by the newly discovered Cluster column.
  • Metrics:
    • Frequency (Count): Which zone is the most active?
    • Intensity (Mean VEI): Is this zone violent (High VEI) or effusive (Low VEI)?
    • Location (Mean Lat/Lon): Where is the "Center of Gravity" of this zone?

Smart Assumption:

  • Different Personalities: We do not expect all zones to be equal.
    • One cluster (likely the Pacific Ring of Fire) should dominate in terms of Eruption_Count.
    • Another cluster might have a lower count but a higher Avg_VEI (indicating a zone of rare but dangerous stratovolcanoes).
In [22]:
def analyze_cluster_profiles(df_spatial):
    """
    Function : Aggregates statistical metrics for each K-Means cluster to determine 
               its geological 'personality' (e.g., High Frequency vs. High Explosivity).
               
               It calculates the center of gravity (mean Lat/Lon), average VEI, 
               and total activity count for each zone.
    
    Input :
        df_spatial (pd.DataFrame): The output of the K-Means function, containing 
                                   'Cluster', 'VEI', 'Latitude', and 'Longitude'.
        
    Output :
        pd.DataFrame: A summary table sorted by 'Eruption_Count', allowing for 
                      quick identification of the most active volcanic zones.
    """
    # We group by the new 'Cluster' column to calculate statistics
    profile = df_spatial.groupby('Cluster').agg({
        'Name': 'count',           # How active is this zone? (Frequency)
        'VEI': 'mean',             # How explosive is this zone on average?
        'Latitude': 'mean',        # Approximate center of the zone
        'Longitude': 'mean'
    }).reset_index()
    
    # Rename columns for clarity
    profile.columns = ['Cluster', 'Eruption_Count', 'Avg_VEI', 'Center_Lat', 'Center_Lon']
    
    # Sort by activity (Count)
    # We want to see the "Super-Clusters" first
    profile = profile.sort_values(by='Eruption_Count', ascending=False)

    # Rounding for cleaner display
    profile['Avg_VEI'] = profile['Avg_VEI'].round(2)
    profile['Center_Lat'] = profile['Center_Lat'].round(1)
    profile['Center_Lon'] = profile['Center_Lon'].round(1)
    
    return profile
In [23]:
cluster_stats = analyze_cluster_profiles(df_ind4)
print("Cluster Profiles (Who is the most active? Who is the most explosive?)")
display(cluster_stats)
Cluster Profiles (Who is the most active? Who is the most explosive?)
Cluster Eruption_Count Avg_VEI Center_Lat Center_Lon
2 2 520 2.07 16.9 105.8
1 1 273 2.66 20.6 -82.3
0 0 94 2.34 -11.5 154.5

Decoding the Clusters:

The K-Means algorithm (k=3) has successfully partitioned the world's volcanoes into three distinct "Tectonic Zones", each with a unique profile:

  1. Cluster 2 (The Asian Giant):

    • Activity: Dominates with 520 eruptions (58% of the total).
    • Location: Center at (17°N, 106°E), pointing to Southeast Asia (Indonesia, Philippines) and Japan.
    • Profile: Lowest Average VEI (2.07). This suggests a region characterized by very frequent but often moderate activity (open-vent volcanoes like Merapi or Sakurajima).
  2. Cluster 1 (The American Cordillera):

    • Activity: Second largest with 273 eruptions.
    • Location: Center at (21°N, 82°W), aligning with the Americas (Andes, Central America, Cascades).
    • Profile: Highest Average VEI (2.66). This indicates that eruptions on the American side of the Pacific Ring of Fire tend to be statistically more explosive than their Asian counterparts.
  3. Cluster 0 (The Southwest Pacific):

    • Activity: Smallest group with 94 eruptions.
    • Location: Center at (11°S, 154°E), covering Papua New Guinea, Vanuatu, and New Zealand.
    • Profile: Intermediate explosivity (VEI 2.34).

Conclusion: The AI successfully reconstructed the "Ring of Fire" and split it into Western (Asia) and Eastern (Americas) components without ever being taught geography.

In [22]:
df_clean.shape
Out[22]:
(887, 36)
In [23]:
df_ind4.shape
Out[23]:
(887, 5)

To make sure the 2 df have the same index for the next function.

Deep Dive: Risk Assessment by Tectonic Zone¶

Objective: To calculate the lethality of each AI-discovered cluster. This moves beyond simple "count" (Frequency) to "impact" (Risk).

Methodology:

  • Data Fusion: We merge the Total Deaths from the original dataset into our spatial clusters.
  • Metrics:
    • Total Deaths: The absolute human cost of the zone.
    • Avg Deaths per Eruption: A measure of "Efficiency". Does every eruption here kill, or just a few big ones?
    • Max Single Event: To identify if the death toll is driven by one "Black Swan" event (e.g., Krakatoa).

Smart Assumption (Risk Equation):

  • Risk = Hazard x Vulnerability.
  • We expect the Asian Cluster (Indonesia/Japan) to have the highest death toll.
  • Reason: Even if the American cluster has higher VEI (as seen in 7b), the Asian cluster has much higher population density living near volcanoes. This proves that geology alone doesn't determine risk; geography does.
In [24]:
def analyze_cluster_risk(df, df_spatial):
    """
    Function : Merges original casualty data with spatial clusters to perform a 
               Risk Assessment analysis. Identifies which geographical zones are 
               the deadliest (Lethality).
    
    Input :
        df (pd.DataFrame): The original cleaned dataset (source of 'Total Deaths').
        df_spatial (pd.DataFrame): The clustered dataset (source of 'Cluster' labels).
        
    Output :
        pd.DataFrame: A risk profile table containing Total Deaths, Average Deaths 
                      per event, and Max Deaths for each cluster, sorted by lethality.
                      
    Note :
        - This function assumes 'df' and 'df_spatial' share the same Index.
    """
    # We need to merge the 'Total Deaths' column back into our spatial dataframe
    # We assume 'df' and 'df_spatial' share the same Index (which they do, 
    # as df_spatial was derived from df without re-indexing).
    df_risk = df_spatial.copy()
    df_risk['Total Deaths'] = df['Total Deaths'].fillna(0) 
    
    # Group by Cluster
    # We calculate three dimensions of risk for each cluster
    risk_profile = df_risk.groupby('Cluster').agg({
        'Total Deaths': ['sum', 'mean', 'max'] 
    }).reset_index()
    
    # Flatten the column names (MultiIndex to single level)
    risk_profile.columns = ['Cluster', 'Total_Deaths', 'Avg_Deaths_per_Eruption', 'Max_Single_Event_Deaths']
    
    # Sort by Total Deaths
    return risk_profile.sort_values(by='Total_Deaths', ascending=False)
In [25]:
risk_stats = analyze_cluster_risk(df_clean, df_ind4)
print("Risk Analysis by Zone:")
display(risk_stats)
Risk Analysis by Zone:
Cluster Total_Deaths Avg_Deaths_per_Eruption Max_Single_Event_Deaths
2 2 199312.0 383.292308 60000.0
1 1 127492.0 467.003663 30000.0
0 0 7219.0 76.797872 2942.0

Interpretation of the Risk Profile:

Merging the spatial clusters with the mortality data reveals where the real danger lies:

  1. Cluster 2 (The Asian Giant - Indonesia/Japan):

    • Total Deaths: ~199,312 (60% of all historical deaths).
    • Max Event: 60,000 (likely Tambora 1815).
    • Conclusion: This is the "Kill Zone". The combination of hyper-active tectonics and extremely high population density makes this the deadliest volcanic region on Earth.
  2. Cluster 1 (The Americas - Andes/Caribbean):

    • Total Deaths: ~127,492.
    • Avg Deaths/Eruption: ~467 (Higher than Cluster 2!).
    • The Efficiency Paradox: Surprisingly, the Americas have a higher average death toll per event than Asia (467 vs 383).
    • Reason: While Asia has more eruptions (diluting the average), the Americas have suffered specific, concentrated tragedies like Mt. Pelée (30,000) and Nevado del Ruiz (23,000) which drive the average up.
  3. Cluster 0 (Southwest Pacific):

    • Total Deaths: ~7,219.
    • Conclusion: Despite having many volcanoes (Vanuatu, NZ), the human toll is historically low. This is likely due to lower population density in these often remote island regions compared to Java or Central America.

Final Takeaway: Risk is not just geological (Frequency/VEI); it is fundamentally geographical (Population Exposure).

Optimization: The Elbow Method (Hyperparameter Tuning)¶

Objective: To determine the optimal number of clusters ($k$) scientifically, rather than choosing a random number.

Methodology:

  • Iterative Testing: We run the K-Means algorithm multiple times, varying $k$ from 1 to 9.
  • Metric: For each $k$, we calculate the Inertia (Sum of Squared Errors). This measures how "compact" the clusters are.
    • Lower Inertia = Better clustering (points are closer to their centroid).

Interpretation:

  • The "Elbow": As we add more clusters, inertia always decreases. However, we look for the "elbow point" where the curve bends and flattens out.
  • Meaning: This point represents the best trade-off between model simplicity and accuracy. Beyond this point, adding more clusters gives diminishing returns (overfitting).
In [25]:
def find_optimal_clusters(df):
    """
    Function : Performs the 'Elbow Method' technique to determine the optimal number 
               of clusters (k) for the K-Means algorithm.
               
               It iterates through k=1 to k=9, calculating the 'Inertia' (Sum of 
               Squared Distances) to find the point of diminishing returns.
    
    Input :
        df (pd.DataFrame): The cleaned dataset containing 'Latitude' and 'Longitude'.
        
    Output :
        None : Displays a Matplotlib plot of Inertia vs. Number of Clusters.
               Look for the "Elbow" (bend) in the curve to pick the best k.
    """
    X = df[['Latitude', 'Longitude']]
    inertia = []
    K_range = range(1, 10) # Test from 1 to 9 clusters

    # Iterative Modeling
    for k in K_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        kmeans.fit(X)
        # Inertia_ attribute stores the sum of squared distances of samples to their closest cluster center.
        inertia.append(kmeans.inertia_) # Inertia = how messy the clusters are
        
    # Plotting the Elbow
    plt.figure(figsize=(8, 4))
    plt.plot(K_range, inertia, 'bo-')
    plt.xlabel('Number of Clusters (k)')
    plt.ylabel('Inertia (Error)')
    plt.title('The Elbow Method: Finding the optimal K')
    plt.show()
In [26]:
find_optimal_clusters(df_clean)
No description has been provided for this image

The optimal K is 3.

In [28]:
import kagglehub
import glob

warnings.filterwarnings('ignore')
print("Libraries imported successfully.")
Libraries imported successfully.

6. Data Enrichment (External Source)¶

Objective: Refine the Mortality indicator by normalizing it against population data.

Rationale: Comparing total deaths favors small countries with huge populations (like Indonesia). By importing World Population Data (2022) from Kaggle, we create a new derived metric: Deaths per Million Inhabitants.

  • Result: This reveals the "Hidden Risk". Small island nations (like St. Vincent or Martinique) often jump to the top of the list because a single eruption can devastate a large percentage of their population.

Data Enrichment (Step 4: External Data Integration)¶

Objective: To refine our risk analysis by normalizing the death toll against population size.

Rationale:

  • The Population Bias: Comparing raw death counts favors small countries with huge populations (like Indonesia). A thousand deaths in a country of 270 million is statistically different from a thousand deaths in a small island nation of 100,000.
  • The Solution: We import a live World Population Dataset (2022) from Kaggle.
  • Derived Metric: We calculate Deaths per Million Inhabitants.
    • $Risk = \frac{Total Deaths}{Population} \times 1,000,000$

Methodology:

  • Live Connection: We use kagglehub to download the data on-the-fly, ensuring the project uses the latest available data source.
  • Data Fusion: We perform a Left Join (SQL-style merge) to attach population figures to our existing list of volcanic countries.
In [27]:
def external_data(df_volcano):
    """
    Function: Downloads external population data via KaggleHub and merges it 
    with the volcano dataset to calculate risk metrics.
    
    We use an external dataset to interpret information.
    We calculate 'Deaths per Million Inhabitants' to normalize the impact.
    
    Input:
        df_volcano (pd.DataFrame): The grouped volcano dataset (by Country).
        
    Output:
        pd.DataFrame: Merged dataframe with population and normalized death rates.
    """
    print("External Data Enrichment")
    
    # Download via KaggleHub
    try:
        print("Downloading dataset: whenamancodes/world-population-live-dataset...")
        # Using kagglehub ensures automation (no need to manually upload a csv)
        path = kagglehub.dataset_download("whenamancodes/world-population-live-dataset")
        print(f"Dataset downloaded to: {path}")
        
        # Find the CSV file in the directory, we use glob to find the csv regardless of its specific filename
        csv_files = glob.glob(os.path.join(path, "*.csv"))
        if not csv_files:
            print("Error: No CSV file found in the downloaded path.")
            return df_volcano
            
        df_pop = pd.read_csv(csv_files[0], encoding='utf-8')
        
    except Exception as e:
        print(f"Failed to download/load external data: {e}")
        # Fail gracefully: return original data so the dashboard doesn't crash
        return df_volcano

    # Prepare External Data
    # We use 'Name' for country and '2022' for the most recent population count
    if 'Name' not in df_pop.columns or '2022' not in df_pop.columns:
        print("Error: Expected columns 'Name' and '2022' not found in external dataset.")
        return df_volcano

    # Renaming for consistency with our main dataset (Name -> Country)
    df_pop = df_pop[['Name', '2022']].rename(columns={'Name': 'Country', '2022': 'Population'})
    
    # Merge with Volcano Data
    # We use inner join or left join to only keep countries that are in the tsv file
    # Left join on df_volcano ensures we keep our original countries, 
    # but we only get population for matches.
    
    df_enriched = pd.merge(df_volcano, df_pop, on='Country', how='left')
    
    # Create Derived Indicator: Deaths per Million
    # Handle NaN populations (countries not found in external set)
    df_enriched = df_enriched.dropna(subset=['Population'])
    
    # Calculation
    # This standardizes the risk, making small islands comparable to large nations.
    df_enriched['Deaths_per_Million'] = (df_enriched['Deaths'] / df_enriched['Population']) * 1000000
    
    print(f"Enrichment successful. Matched {len(df_enriched)} countries.")

    # Sorting to show the highest relative risk first
    return df_enriched.sort_values('Deaths_per_Million', ascending=False)

9. Final Step: The Visualization Dashboard¶

Objective: To present the extracted knowledge in an interactive, user-friendly web interface. This completes the KDD loop: Data -> Information -> Knowledge -> Visualization.

UX/UI Design Decisions:

  • Storytelling Layout: The dashboard is organized logically to tell a story:
    1. The Human Cost (Top): We start with the most emotional metric (Deaths).
    2. The Context (Middle): We explore the "Why" (Explosivity) and the "Where" (Tectonic Zones).
    3. The Critique (Bottom): We end with the temporal analysis to warn the user about data biases.
  • Interactivity (The "Bonus" Metric):
    • We implemented a Radio Button to toggle between "Raw Deaths" and "Deaths per Million".
    • Goal: This allows the user to instantly see the impact of our data enrichment step.
  • Stability: We enforced fixed heights (400px) on all graphs.
    • Technical Reason: Plotly charts inside Flexbox containers can sometimes trigger an "infinite resize loop" bug. Fixed heights ensure a stable rendering.
  • Aesthetics:
    • Card Design: Graphs are encapsulated in white containers with soft shadows to separate visual elements.
    • Theme: We use plotly_white to remove distracting background grids.
In [33]:
def run_dashboard(df_deaths_enriched, df_vei, df_trend, df_spatial):
    """
    Function: Initializes and configures the interactive Dash visualization dashboard.
    
    This version uses fixed heights for graphs to prevent the "infinite resize" bug.
    It organizes the layout into three rows:
    1. Impact Analysis (Bar Chart with metric selector).
    2. Power & Space (Pie Chart and Geo Map side-by-side).
    3. Temporal Analysis (Full-width Line Chart).
    
    Input:
        df_deaths_enriched (pd.DataFrame): Dataframe with mortality and calculated risk metrics.
        df_vei (pd.DataFrame): Dataframe with the frequency count of VEI categories.
        df_trend (pd.DataFrame): Dataframe with eruption counts per century and trend lines.
        df_spatial (pd.DataFrame): Dataframe with coordinates and cluster labels.
        
    Output:
        dash.Dash: The configured Dash application ready to be run.
    """
    
    app = Dash(__name__)
    
    # Styles
    # Define a reusable style for the "Cards" (white boxes with shadow)
    # This makes the dashboard look professional and organized.
    card_style = {
        'backgroundColor': 'white',
        'padding': '20px',
        'borderRadius': '12px',
        'boxShadow': '0 4px 6px rgba(0,0,0,0.1)',
        'marginBottom': '25px'
    }
    
    # Temporal filter for clarity
    # The "long tail" of empty data before 2000 BC compresses the view 
    # and makes the modern "hockey stick" curve harder to see.
    df_trend_viz = df_trend[df_trend['Century'] >= -2000]

    # HTML structure
    app.layout = html.Div(style={'fontFamily': 'Segoe UI, Roboto, Helvetica, Arial, sans-serif', 'padding': '30px', 'backgroundColor': '#f0f2f5'}, children=[
        
        # Header
        html.Div(style={'textAlign': 'center', 'marginBottom': '40px'}, children=[
            html.H1("Volcanic Eruptions Analysis", style={'color': '#2c3e50', 'fontWeight': 'bold'}),
            html.P("Final Project - KDD Process Demonstration", style={'fontSize': '18px', 'color': '#7f8c8d'}),
            html.Div(style={'marginTop': '10px', 'fontSize': '14px', 'color': '#95a5a6'}, children=[
                "Team: Isabela MORA, Mathieu MAURY, Paul MILLIEN, Yannix MICHOUX | Data: NOAA & Kaggle"
            ]),

            html.Div(style={'marginTop': '20px', 'padding': '15px', 'backgroundColor': '#e8f4f8', 'borderRadius': '8px', 'display': 'inline-block', 'maxWidth': '800px'}, children=[
                html.H5("Project Objective:", style={'margin': '0 0 5px 0', 'color': '#2980b9'}),
                html.P("To apply the Knowledge Discovery in Databases (KDD) process to clean, transform, and visualize historical volcanic data. The goal is to identify high-risk zones, analyze the relationship between explosivity and lethality, and detect temporal recording biases.", 
                       style={'margin': '0', 'fontSize': '15px', 'fontStyle': 'italic', 'color': '#34495e'})
            ])
        ]),

        # Row 1: Mortality
        html.Div(style=card_style, children=[
            html.H3("1. Impact Analysis: The Human Cost", style={'color': '#e74c3c'}),
            
            # Control Panel (Radio Buttons)
            html.Div(style={'display': 'flex', 'alignItems': 'center', 'marginBottom': '15px'}, children=[
                html.Label("Select Metric:", style={'marginRight': '15px', 'fontWeight': 'bold'}),
                dcc.RadioItems(
                    id='metric-selector',
                    options=[
                        {'label': ' Total Deaths (Raw)', 'value': 'Deaths'},
                        {'label': ' Risk (Deaths / Million pop)', 'value': 'Deaths_per_Million'}
                    ],
                    value='Deaths',
                    labelStyle={'display': 'inline-block', 'marginRight': '20px', 'cursor': 'pointer'},
                    inputStyle={"margin-right": "5px"}
                )
            ]),

            # Graph Container
            # Fixed height for safety
            dcc.Graph(id='mortality-graph', style={'height': '400px'})
        ]),
        
        # Row 2 : Pie and map
        html.Div(style={'display': 'flex', 'gap': '25px', 'flexWrap': 'wrap'}, children=[
            
            # Pie Chart (donut style)
            html.Div(style={**card_style, 'flex': '1', 'minWidth': '400px'}, children=[
                html.H3("2. Power Distribution", style={'color': '#2980b9', 'textAlign': 'center'}),
                # style={'height': '400px'} prevents infinite expansion
                dcc.Graph(figure=px.pie(df_vei, names='Category', values='Count', hole=0.4, 
                                      color_discrete_sequence=px.colors.sequential.RdBu_r)
                          .update_layout(template='plotly_white', legend=dict(orientation="h", y=-0.1))
                          .update_traces(textinfo='percent+label'),
                          style={'height': '400px'}) 
            ]),
            
            # Map
            html.Div(style={**card_style, 'flex': '1', 'minWidth': '400px'}, children=[
                html.H3("3. Tectonic Clusters", style={'color': '#27ae60', 'textAlign': 'center'}),
                # style={'height': '400px'} prevents infinite expansion
                dcc.Graph(figure=px.scatter_geo(df_spatial, lat='Latitude', lon='Longitude',
                                              color='Cluster', size='VEI', hover_name='Name',
                                              projection="natural earth",
                                              color_continuous_scale='Viridis',
                                              size_max=10)
                          .update_layout(template='plotly_white', margin={"r":0,"t":0,"l":0,"b":0}),
                          style={'height': '400px'})
            ]),
        ]),
        
        # Row 3: Temporal trend
        html.Div(style=card_style, children=[
            html.H3("4. Temporal Bias Analysis", style={'color': '#8e44ad'}),
            html.P("Observing the 'Recency Bias': Eruptions didn't increase, our ability to record them did.", style={'fontStyle': 'italic'}),
            
            dcc.Graph(figure=go.Figure()
                      .add_trace(go.Scatter(x=df_trend_viz['Century'], y=df_trend_viz['Count'], 
                                          mode='markers', name='Recorded Eruptions', 
                                          marker=dict(color='#bdc3c7', size=8)))
                      .add_trace(go.Scatter(x=df_trend_viz['Century'], y=df_trend_viz['Trend'], 
                                          mode='lines', name='Linear Trend', 
                                          line=dict(color='#e74c3c', width=3, dash='dash')))
                      .update_layout(
                          template='plotly_white',
                          xaxis_title="Century",
                          yaxis_title="Number of Eruptions Recorded",
                          hovermode="x unified"
                      ),
                      # Fixed height here too
                      style={'height': '450px'})
        ])
    ])
    
    # Callback (Interactivity logic)
    @app.callback(
        Output('mortality-graph', 'figure'),
        Input('metric-selector', 'value')
    )
    def update_graph(selected_metric):
        """
        Function: Updates the Bar Chart dynamically based on the Radio Button selection.
        
        Logic:
        1. Filters the dataset to find the Top 10 countries for the chosen metric.
        2. Adjusts the title and color scale to provide visual context.
        3. Returns a new Plotly figure to the dashboard layout.
        
        Input:
            selected_metric (str): The value from the radio button ('Deaths' or 'Deaths_per_Million').
            
        Output:
            plotly.graph_objects.Figure: The updated bar chart ready for rendering.
        """

        # Dynamic Sorting
        # We re-calculate the Top 10 every time the user clicks
        df_sorted = df_deaths_enriched.sort_values(selected_metric, ascending=False).head(10)

        # Contextual Styling
        if selected_metric == 'Deaths':
            title = "Top 10 Countries by Total Deaths"
            color_scale = 'Reds'
        else:
            title = "Top 10 Countries by Risk (Deaths / Million Hab.)"
            color_scale = 'Oranges'

        # Figure Generation
        fig = px.bar(df_sorted, x='Country', y=selected_metric, text_auto='.2s',
                     color=selected_metric, color_continuous_scale=color_scale)

        # Clean up the layout (remove axis title as it's self-explanatory)
        fig.update_layout(template='plotly_white', title=title, xaxis_title=None)
        return fig

    return app
In [34]:
if __name__ == '__main__':
    # Functions test

    #Full volcano analysis
    file_path = "volcano-events-2024-12-17_11-00-15_+0100.tsv"
    df = load_data(file_path)
    df = summarize_data(df)
    df_clean = data_cleaning(df)
    df_clean.head(3)
    full_volcano_analysis(df)

    #Indicator mortality
    df_ind1 = indicator_mortality(df_clean)[0]
    print("Top 5 countries by mortality :")
    print(df_ind1.head())

    #Comparative table of the mortality depending on metric chose
    df_grouped = indicator_mortality(df_clean)[1]
    print("Top 10 Comparison by Indicator:")
    display(top_metrics(df_grouped))

    #Indicator VEI distribution
    df_ind2 = indicator_vei_distribution(df_clean)[0]
    print(df_ind2)

    #VEI lethality
    analyze_vei_lethality(df_clean)

    #Indicator temporal
    df_ind3 = temporal_trend(df_clean)
    print(df_ind3.tail())

    #Recency bias
    df_trend_ancient = temporal_trend(df_clean, start_year=-2000)
    df_trend_modern = temporal_trend(df_clean, start_year=1500)
    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    sns.scatterplot(data=df_trend_ancient, x='Century', y='Count')
    sns.lineplot(data=df_trend_ancient, x='Century', y='Trend', color='red')
    plt.title('Since -2000 : The line crushes everything')
    plt.subplot(1, 2, 2)
    sns.scatterplot(data=df_trend_modern, x='Century', y='Count')
    sns.lineplot(data=df_trend_modern, x='Century', y='Trend', color='green')
    plt.title('Since 1500 : Clean tendency')
    plt.show()

    #Indicator spatial
    df_ind4 = spatial_clustering(df_clean)
    print(df_ind4.head())
    #Clustering results
    plot_spatial_clusters(df_ind4)

    #Cluster profiles
    cluster_stats = analyze_cluster_profiles(df_ind4)
    print("Cluster Profiles (Who is the most active? Who is the most explosive?)")
    display(cluster_stats)

    #Risk analysis by zone
    risk_stats = analyze_cluster_risk(df_clean, df_ind4)
    print("Risk Analysis by Zone:")
    display(risk_stats)

    #Optimization of the k parameter with the elbow method
    find_optimal_clusters(df_clean)
    
    # Filepath
    file_path = "volcano-events-2024-12-17_11-00-15_+0100.tsv"
    
    # 1. Load & Clean
    df = load_data(file_path)
    if df is not None:
        df_clean = data_cleaning(df)
        
        # 2. Calculate Indicators
        df_deaths_raw = indicator_mortality(df_clean)[1] 
        df_vei = indicator_vei_distribution(df_clean)[0]
        
        df_trend = temporal_trend(df_clean)
        df_spatial = spatial_clustering(df_clean)
        
        # 3. Enrich
        df_deaths_enriched = external_data(df_deaths_raw)
        
        # 4. Run Dashboard
        print("Starting Enhanced Dashboard...")
        app = run_dashboard(df_deaths_enriched, df_vei, df_trend, df_spatial)
        app.run(mode='inline', port=8060)
Data loaded successfully: (888, 35)

Data Summarization

First 2 Rows
  Search Parameters    Year  Mo  Dy  Tsu  Eq      Name     Location  \
0                []     NaN NaN NaN  NaN NaN       NaN          NaN   
1               NaN -4360.0 NaN NaN  NaN NaN  Macauley  Kermadec Is   

       Country  Latitude  Longitude  Elevation (m)     Type  VEI Agent  \
0          NaN       NaN        NaN            NaN      NaN  NaN   NaN   
1  New Zealand    -30.21   -178.475          238.0  Caldera  6.0   NaN   

   Deaths  Death Description  Missing  Missing Description  Injuries  \
0     NaN                NaN      NaN                  NaN       NaN   
1     NaN                NaN      NaN                  NaN       NaN   

   Injuries Description  Damage ($Mil)  Damage Description  Houses Destroyed  \
0                   NaN            NaN                 NaN               NaN   
1                   NaN            NaN                 NaN               NaN   

   Houses Destroyed Description  Total Deaths  Total Death Description  \
0                           NaN           NaN                      NaN   
1                           NaN           NaN                      NaN   

   Total Missing  Total Missing Description  Total Injuries  \
0            NaN                        NaN             NaN   
1            NaN                        NaN             NaN   

   Total Injuries Description  Total Damage ($Mil)  Total Damage Description  \
0                         NaN                  NaN                       NaN   
1                         NaN                  NaN                       NaN   

   Total Houses Destroyed  Total Houses Destroyed Description  
0                     NaN                                 NaN  
1                     NaN                                 NaN  

Column Names
['Search Parameters', 'Year', 'Mo', 'Dy', 'Tsu', 'Eq', 'Name', 'Location', 'Country', 'Latitude', 'Longitude', 'Elevation (m)', 'Type', 'VEI', 'Agent', 'Deaths', 'Death Description', 'Missing', 'Missing Description', 'Injuries', 'Injuries Description', 'Damage ($Mil)', 'Damage Description', 'Houses Destroyed', 'Houses Destroyed Description', 'Total Deaths', 'Total Death Description', 'Total Missing', 'Total Missing Description', 'Total Injuries', 'Total Injuries Description', 'Total Damage ($Mil)', 'Total Damage Description', 'Total Houses Destroyed', 'Total Houses Destroyed Description']

Data Types
Search Parameters                      object
Year                                  float64
Mo                                    float64
Dy                                    float64
Tsu                                   float64
Eq                                    float64
Name                                   object
Location                               object
Country                                object
Latitude                              float64
Longitude                             float64
Elevation (m)                         float64
Type                                   object
VEI                                   float64
Agent                                  object
Deaths                                float64
Death Description                     float64
Missing                               float64
Missing Description                   float64
Injuries                              float64
Injuries Description                  float64
Damage ($Mil)                         float64
Damage Description                    float64
Houses Destroyed                      float64
Houses Destroyed Description          float64
Total Deaths                          float64
Total Death Description               float64
Total Missing                         float64
Total Missing Description             float64
Total Injuries                        float64
Total Injuries Description            float64
Total Damage ($Mil)                   float64
Total Damage Description              float64
Total Houses Destroyed                float64
Total Houses Destroyed Description    float64
dtype: object

Shape
Rows: 888, Columns: 35

Missing Values
                                    Missing Count  Percentage
Search Parameters                             887   99.887387
Total Missing                                 876   98.648649
Missing                                       876   98.648649
Missing Description                           874   98.423423
Total Missing Description                     873   98.310811
Damage ($Mil)                                 864   97.297297
Total Damage ($Mil)                           860   96.846847
Houses Destroyed                              838   94.369369
Total Houses Destroyed                        829   93.355856
Eq                                            809   91.103604
Injuries                                      787   88.626126
Total Injuries                                784   88.288288
Injuries Description                          764   86.036036
Houses Destroyed Description                  760   85.585586
Total Injuries Description                    754   84.909910
Total Houses Destroyed Description            733   82.545045
Tsu                                           706   79.504505
Damage Description                            640   72.072072
Total Damage Description                      619   69.707207
Deaths                                        445   50.112613
Total Deaths                                  423   47.635135
Agent                                         364   40.990991
Death Description                             318   35.810811
Total Death Description                       292   32.882883
Dy                                            192   21.621622
VEI                                           183   20.608108
Mo                                            132   14.864865
Year                                            1    0.112613
Type                                            1    0.112613
Latitude                                        1    0.112613
Elevation (m)                                   1    0.112613
Location                                        1    0.112613
Country                                         1    0.112613
Longitude                                       1    0.112613
Name                                            1    0.112613

Duplicates
Total Duplicated Rows: 0

Key Numerical Statistics
           count     mean      std      min      25%      50%      75%  \
Year       887.0  1733.90   713.80 -4360.00  1800.00  1926.00  1988.00   
VEI        705.0     2.87     1.30     0.00     2.00     3.00     4.00   
Deaths     443.0   438.86  2413.86     1.00     1.00     5.00    48.50   
Latitude   887.0    15.06    25.66   -63.00    -6.51    13.26    36.41   
Longitude  887.0    53.09   101.19  -178.48   -19.67   110.45   130.86   

                max  
Year        2024.00  
VEI            7.00  
Deaths     30000.00  
Latitude      65.72  
Longitude    177.18  
Data cleaned : (887, 35)
No description has been provided for this image

Key stats


   Total Deaths (Historical) : 334,023
   - Directs (Lava, Ashes...) : 194,414 (58.2%)
   - Indirects (Tsunamis, Earthquakes) : 139,609 (41.8%)

Analysis of descriptive codes (Deaths):
Death Description
1: Few (<50 deaths)            409
2: Some (51-100 deaths)         42
3: Many (101-1000 deaths)       81
4: Very Many (>1000 deaths)     38
Name: count, dtype: int64
Top 5 countries by mortality :
        Country  Total Deaths   Deaths  Indirect Deaths
18    Indonesia      143307.0  57311.0          85996.0
11  El Salvador       30327.0  30327.0              0.0
21   Martinique       29523.0  29523.0              0.0
6      Colombia       25475.0  24825.0            650.0
19        Italy       23069.0   7068.0          16001.0
Top 10 Comparison by Indicator:
Sort: TOTAL Sort: DIRECT Sort: INDIRECT
Country Total Deaths Country Deaths Country Indirect Deaths
0 Indonesia 143307.0 Indonesia 57311.0 Indonesia 85996.0
1 El Salvador 30327.0 El Salvador 30327.0 Japan 17751.0
2 Martinique 29523.0 Martinique 29523.0 Italy 16001.0
3 Colombia 25475.0 Colombia 24825.0 Iceland 9950.0
4 Italy 23069.0 Philippines 7456.0 Guatemala 7500.0
5 Japan 22680.0 Italy 7068.0 Portugal 1053.0
6 Guatemala 10425.0 Papua New Guinea 5488.0 Colombia 650.0
7 Iceland 10188.0 United States 5485.0 Philippines 474.0
8 Philippines 7930.0 Ecuador 5358.0 Mexico 100.0
9 United States 5567.0 Japan 4929.0 United States 82.0
         Category  Count
0  Moderate (3-4)    330
1      Weak (0-2)    306
2    Violent (5+)     69
No description has been provided for this image
    Century  Count      Trend
32   1600.0     45  66.202365
33   1700.0     63  69.078035
34   1800.0    150  71.953705
35   1900.0    374  74.829375
36   2000.0    144  77.705045
No description has been provided for this image
       Name  Latitude  Longitude  VEI Cluster
1  Macauley   -30.210   -178.475  6.0       1
2     Kikai    30.793    130.305  7.0       2
3    Masaya    11.985    -86.165  6.0       1
4    Witori    -5.576    150.516  6.0       0
5      Taal    14.002    120.993  6.0       2
Cluster Profiles (Who is the most active? Who is the most explosive?)
Cluster Eruption_Count Avg_VEI Center_Lat Center_Lon
2 2 520 2.07 16.9 105.8
1 1 273 2.66 20.6 -82.3
0 0 94 2.34 -11.5 154.5
Risk Analysis by Zone:
Cluster Total_Deaths Avg_Deaths_per_Eruption Max_Single_Event_Deaths
2 2 199312.0 383.292308 60000.0
1 1 127492.0 467.003663 30000.0
0 0 7219.0 76.797872 2942.0
No description has been provided for this image
Data loaded successfully: (888, 35)
Data cleaned : (887, 35)
External Data Enrichment
Downloading dataset: whenamancodes/world-population-live-dataset...
Dataset downloaded to: C:\Users\isabe\.cache\kagglehub\datasets\whenamancodes\world-population-live-dataset\versions\1
Failed to download/load external data: name 'glob' is not defined
Starting Enhanced Dashboard...